In this project, we will explore a dataset containing 1,599 red wines with 11 properties of the wine. The dataset also contains quality of each wine rated by at least 3 wine experts. The purpose of this project is to practice EDA (Exploratory Data Analysis) by analyzing the dataset to find out which chemical properties influnce the quality of red wines.
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
We first get some overview of the dataset.
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
X is the row number and isn’t necessarily relevant in the exploration. We removed this column.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Summary of the dataset:
quality is a discrete variable which ranges from 0 to 10. However, our dataset only contains quality values from 3 to 8.fixed.acidity, volatile.acidity and citric.acid are different types of acids in wines.free.sulfur.dioxide is the subset of total.sulfur.dioxide.Then, we take a look at distributions of single variables.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
We transformed quality to ordered factor.
## Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
As we can see from the results above, although quality ranges from 0 to 10, the red wines in our dataset have discrete scores from 3 to 8. Overall, the quality score follows normal distribution to some extent, with most wines of score 5-6 and the rest of 3, 4, 7, 8.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The above diagram seems to be long-tailed. We plot it on base 10 logarithmic scale.
After transforming the data to log10 base, the diagram seems to be normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The above diagram seems to be long-tailed with some outlier.
The diagram seems to be normal distribution after plotting on base 10 logarithmic scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The above diagram shows that there seem to be a lot zero values of citric.acid in our dataset.
## [1] 132
In order to better understand different acids in wine making, I searched online and found the link of Acids in Wine. Fixed acid, volatile acid and citric acid are different types of acids in wine. According to the description of each of them, fixed acid refers to most acids involved with wine, which is also called nonvolatile acid; Volatile acid level cannot be too high in wine, otherwise, it can lead to an unpleasant, vinegar taste; Citric acid usually is found in small quantities in wine and it can add “freshness” and flavor to wines. It makes sense that the amount of both volatile and citric acids in wines are much smaller than that of fixed acid. It is possible that for some wines, the citric acid level is too small to be detected or the data has been rounded to zero value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The above diagram shows that the residual.sugar has some extreme outliers. We need to take these ourlier into account in further analysis.
Excluding the outliers, residual.sugar data is more of normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The chlorides data has some extreme outliers. We need to take them into account in further analysis.
Excluding the ourliers, chlorides data is more of normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The free.sulfur.dioxide data is skewed to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
total.sulfur.dioxide data is skewed to the right and has some extreme outliers. According to the description of attributes, free.sulfur.dioxide is included in total.sulfur.dioxide. total.sulfur.dioxide consists of free and bound forms of SO2. From the obove two diagrams, free.sulfur.dioxide and total.sulfur.dioxide have similar distribution. We are also interested in the bound.sulfur.dioxide and an additional variable will be created to help the investigation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The density follows normal distrbution with few outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH follows normal distrbution with few outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The sulphates data is long-tailed and has some outliers.
Excluding outliers, sulphates data is more of normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The alcohol level is from 8.4% to 14.9%. Most wines have alcohol of 9.5% and the average alcohol level is 10.42%. From the diagram above, alcohol data is skewed to the right.
The structure of the dataset has been provided in the “summary of the dataset”.
We are interested in investigating which features influence the quality of wines. By observing the distribution of single variable, we cannot tell which feature(s) determine the quality. Further exploration is needed.
Further exploration is needed.
fixed.acitidity, volatile.acitidy and citric.acid to acid to investigate the overall influence of acids in wine quality.## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.270 7.827 8.720 9.118 10.070 17.050
The acid data seems to be normal distribution.
bound.sulfur.dioxide is created to investigate the influence of other dioxide except free sulful dioxide.## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 21.00 30.59 39.00 251.50
Just like total.sulfur.dioxide and free.sulfur.dioxide, the bound.sulfur.dioxode is skewed to the right with some extreme outliers.
We observed some variables have long-tail distribution with outliers. For some of them, we plot the diagram in base 10 logarithmic scale to get a normal distribution. Detailed information can be found in the above section. We choose not to tidy or adjust the form of the data here since we would like to explore relations between variables in the next sections using the original data. And we will transform the data format in the next sections when needed.
To understand the relations between different variables, we compute and plot correlation between each pair of them.
To get a more intuitive view, we plot the correlation as following:
Based on the above diagrams, we examined the relations between all variables in the dataset and found relatively strong correlations with wine quality:
Note: Rank based on significance level and strength of correlation.
Then, we further investigate the correlations between quality and some of the transformed variable. Specifically, we compute the correlation values using the base 10 logarithmic scale of some variables discussed in Univariate Plots Section.
## alcohol_log10 volatile.acidity_log10
## 0.47698109 -0.39124918
## sulphates_log10 bound.sulfur.dioxide_log10
## 0.30864193 -0.20830074
## chlorides_log10 total.sulfur.dioxide_log10
## -0.17613996 -0.17014272
## fixed.acidity_log10 free.sulfur.dioxide_log10
## 0.11423756 -0.05008749
## residual.sugar_log10
## 0.02353331
We can see that on base 10 logarithmic scale, some variables have stronger relations with wine quality. By setting the threshold to 0.3, we got the following variables influencing wine quality:
We also use boxplot to show relations between the top 9 relatively high correlated variables with quality and outliers.
Density estimate for top 4 relatively high correlated variables with quality.
We also would like to know the correlations between other variables. It is possible that some variables are highly correlated and the highly correlated variables can either both “good” for wine or “bad” for wine.
## row column cor p
## 67 fixed.acidity acid 0.9963844 0
## 85 total.sulfur.dioxide bound.sulfur.dioxide 0.9576864 0
## 69 citric.acid acid 0.6904382 0
## 75 pH acid -0.6834838 0
## 29 fixed.acidity pH -0.6829782 0
## 74 density acid 0.6755958 0
## 2 fixed.acidity citric.acid 0.6717035 0
## 22 fixed.acidity density 0.6680470 0
## 21 free.sulfur.dioxide total.sulfur.dioxide 0.6676664 0
## 3 volatile.acidity citric.acid -0.5524957 0
## 31 citric.acid pH -0.5419042 0
## 53 density alcohol -0.4961796 0
## 84 free.sulfur.dioxide bound.sulfur.dioxide 0.4251489 0
## 41 chlorides sulphates 0.3712605 0
## 24 citric.acid density 0.3649471 0
## 25 residual.sugar density 0.3552834 0
## 36 density pH -0.3416989 0
## 39 citric.acid sulphates 0.3127700 0
## 33 chlorides pH -0.2650261 0
## 38 volatile.acidity sulphates -0.2609867 0
The above results show that some pairs of variables have strong correlations.
fixed.acidity, volatile.acidity, citric.acid and acid have relatively high correlations. It makes sense since they are all different types of acids and acid is the sum of the other three acids. Wines with high level of fixed.acidity more likely have high level of citric.acid and vice versa. It also applys to other pairs of acids. We also notice that volatile.acidity has strong correlation with wine quality and the other three acid variables have weaker correlations as well.pH has relatively strong correlations with acid variables. As we know, pH is a numerical scale to specify acidity. Therefore, the correlations between them make sense. However, although acid variables have correlations with wine quality to some extent, pH doesn’t show significant correlation. This brings a question that if A correlates with B, B correlates with C, does A correlates with C as well? We found interesting answers here. Obviously, A does not necessarily correlates with C, which explains our results.free.sulfur.dioxide, bound.sulfur.dioxide and total.sulfur.dioxide have high correlations. It is because the former two variables are subset of the latter one. bound.sulfur.dioxide and total.sulfur.dioxide have similar correlation level with wine quality. free.sulfur.dioxide has much weaker correlation with quality variable.density has relatively strong correlations with acids variables except volatile.acidity. It is interesting to observe that the correlations between density and the acids variables are all obove 0.3 but density and volatile.acidity have insignificant correlation.density has correlations with alcohol, pH and residual.sugar to some extent. alcohol is an important variable in influencing wine quality.chlorides and sulphates has relatively high correlation. The two variables in base 10 logarithmic scale have high correlations with wine quality.fixed.acidity, citric.acid, volatile.acidity, acid
From the above diagrams, we see that acids variables have relatively high correlations. They are all different types of acides in wine. It is also interesting to know that fixed.acidity and citric.acid are positively correlated, while the other two pairs of acids are negatively correlated.
acid vs. fixed.acidity, citric.acid, volatile.acidity
acid is the sum of the other three types of acids variables. It makes sense that they are highly correlated with acid. The correlation between fixed.acidity and acid is very strong (0.996) and it is because fixed.acidity is the main component of acid. Also, we should notice that volatile.acidity is negatively correlated with acid.
pH vs. acid, fixed.acidity, citric.acid, volatile.acidity
pH is a numerical scale to measure the level of acids and we can see that pH has relatively strong correlation with different types of acids variables. By definition of pH, we know that the higher level of acids, the lower pH value. We notice that volatile.acidity has positive correlation with pH which means the higher volatile.acidity level, the higher pH. It is possible that the pH level in wine is mainly determined by the other types of acids and the amount of volatile.acidity is too small to influence pH level.
density vs. acid, fixed.acidity, citric.acid, volatile.acidity
Similar to pH, density has relatively strong correlation with acids variables. All correlations are positive except the correlation with volatile.acidity.
density vs. alcohol, pH, residual.sugar
density also has correlation with alcohol, pH and residual.sugar to some extent. The correlations between alcohol and density, pH and densitiy are negative, while the correlation between residual.sugar and density is positive.
chlorides vs. sulphates
sulphates and chlorides have correlation (0.37) to some extent. It is not very high in general but relatively high compared to other pairs of variables in our dataset.
free.sulfur.dioxide, bound.sulfur.dioxide, total.sulfur.dioxide
We can see that sulfur.dioxide variables have relatively strong correlations which makes sense since they are different types of sulfur.dioxide. All pairs of variables have positive correlations.
Note: The analysis has been provided with the bivariate plots.
fixed.acidity, volatile.acidity and quality
The diagram shows that although volatile.acidity relatively strongly correlated with wine quality, fixed.acidity does not. Different acids play different roles in influencing wine quality.
pH, acid and quality
acid and pH are strongly correlated which is in accordance with common sense. However, neither of them show strong correlation with quality.
density, volatile.acidity and quality
density doesn’t correlate with quality or volatile.acidity.
density, alcohol and quality
density correlates with alcohol to some extent. Higher level of alcohol in wine of better quality. But density alone does’t influence quality much.
chlorides, sulphates
Although the correlation between sulphates and chlorides is 0.37, intuitively, it is interesting to see that they don’t show strong correlation in the diagram.
Note: They analysis has been provided with all the plots above.
The diagram above is used to show the influences of different types of acids in wine quality. In general, higher level of acids results in higher wine quality. However, throughout our analysis, we found out that volatile.acidity has negative correlation with wine quality. In addition, volatile.acidity has strong correlation with wine quality.
alcohol, volatile.acidity and sulphates_log10 are relatively strongly correlated with wine quality compared to other variables in our dataset. The above diagram shows the relations between these three variables and wine quality. Generally, higher alcohol level, lower volatile.acidity level and higher sulphates_log10 level lead to better red wines. However, the highest correlation we found in the variables with quality is alcohol 0.48, which is still not a very strong correlation.
The above diagram shows the relations between alcohol and volatile.acidity, both of which are highly correlated with wine quality. alcohol is positively correlated with quality and volatile.acidity is negatively correlated with quality. However, the two variables themselves are not correlated. In addition, we should notice that both correlations are not very strong.
In summary, from the whole analysis on the dataset, we make conclusions that the key properties in determining wine quality is alcohol, volatile.acidity and sulphates in base 10 logarithmic scale. We should notice that the correlation between these variables and the wine quality is not very strong and we cannot simply concludes there is clearly linear relations between these variables. Besides, the quality scores of all wines are based on subjective experts comments, which may be biased. We cannot easily tell the quality of wines just based on the dataset conclusions.
In addition, variables in our dataset can have more types of transformations and we may find more interesting relations between variables by further exploring the dataset. For example, ratios of different variables can be computed to see if they also influence the wine quality.
Through the analysis, the hardest part is to find out a way to investigate the relations between variables. The idea we used in our analysis is to compute correlations between all variables to see if there are strong ones. Based on the results, we can further explore the relations between variables. For example, inferential statistics can be used to infer from the dataset we have what properties can determine the wine quality. Besides, machine learning algorithms like regression can be used to train a model based on the key properties we found in our analysis to predict the quality of wine. However, this method only considers the relations between pairs of variables and we cannot tell more complex relations among variables. What if the relations are not linear? What if we transform the variable into another scale? What if we combine two variables (multiply, divide, etc.)? There are still a lot possibilities.
Quality of wines is not easy to determine. However, it is a very interesting topic to look into in the future.